Sunah Park

This markdown file is created by Sunah Park for extended lab exercises in the book An Introduction to Statistical Learning with Applications in R (ISLR).


Setup for code chunks

rm(list=ls())
# default r markdown global options in document
knitr::opts_chunk$set(
   ########## Text results ##########
    echo=TRUE, 
    warning=FALSE, # to preserve warnings in the output 
    error=FALSE, # to preserve errors in the output
    message=FALSE, # to preserve messages
    strip.white=TRUE, # to remove the white lines in the beginning or end of a source chunk in the output 

    ########## Cache ##########
    cache=TRUE,
   
    ########## Plots ##########
    fig.path="", # prefix to be used for figure filenames
    fig.width=8,
    fig.height=6,
    dpi=200
)

library(ISLR)
library(e1071) # svm() function
library(ggplot2)
library(caret)
library(ROCR) # rocplot() function

1. Suppert Vector Classifier

set.seed(1)
x<-matrix(rnorm(20*2), ncol=2)
y<-c(rep(-1,10), rep(1,10))
x[y==1,]=x[y==1,]+1

plot(x, col=(3-y)) # check whether the classes are linearly separable -> They are not separable.

dat<-data.frame(x=x, y=as.factor(y))
svmfit<-svm(y~., data=dat, kernel="linear", cost=10, scale=FALSE) # kernel: linear/polynomial/radial basis/sigmoid, cost: cost of constraints violation (default:1) - "C" constant of the regularization term in the Lagrange formulation
plot(svmfit,dat) # -1: blue, 1: purple

svmfit$index
## [1]  1  2  5  7 14 16 17
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.5 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

The decision boundary between the two classes is linear (Because we used the argument kernel=“linear”), though due to the way in which the plotting function is implemented in this library the decision boundary looks somewhat jagged in the plot. Only one observation is misclassified. The support vectors are plotted as crosses and the remaining observations are plotted as circles. There are seven support vectors (svmfit$index). A linear kernel was used with cost=10 and there were seven support vectors, four in one class and three in the other.

What if we instead used a smaller value of the cost parameter?

svmfit<-svm(y~., data=dat, kernel="linear", cost=0.1, scale=FALSE)
plot(svmfit,dat) # -1: blue, 1: purple

svmfit$index
##  [1]  1  2  3  4  5  7  9 10 12 13 14 15 16 17 18 20
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 0.1, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

A smaller value of the cost parameter is being used, we obtain a larger number of support vectors, because the margin is now wider. The svm() function does not explicitly output the coefficients of the linear decision boundary obtained when the support vector classifier is fit, nor does it output the width of the margin.

The e1071 library includes a built-in function, tune(), to perform cross-validation (by default 10-folds CV).

set.seed(1)
tune.out<-tune(svm, y~., data=dat, kernel="linear",
               ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100))) # ranges: list of parameter vectors spanning the sampling space
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.1 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03  0.70  0.4216370
## 2 1e-02  0.70  0.4216370
## 3 1e-01  0.10  0.2108185
## 4 1e+00  0.15  0.2415229
## 5 5e+00  0.15  0.2415229
## 6 1e+01  0.15  0.2415229
## 7 1e+02  0.15  0.2415229
plot(tune.out)

tune.out$best.parameters
##   cost
## 3  0.1
bestmod<-tune.out$best.model
summary(bestmod)
## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

We see that cost=0.1 results in the lowest cross-validation error rate. The tune() function stores the best model obtained by best.model.

set.seed(1)
xtest<-matrix(rnorm(20*2), ncol=2)
ytest<-sample(c(-1,1), 20, rep=TRUE)
xtest[ytest==1,]=xtest[ytest==1,]+1

testdat<-data.frame(x=xtest, y=as.factor(ytest))
ypred<-predict(bestmod, testdat)
table(ypred, testdat$y)
##      
## ypred -1  1
##    -1 10  1
##    1   1  8
confusionMatrix(ypred, testdat$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction -1  1
##         -1 10  1
##         1   1  8
##                                          
##                Accuracy : 0.9            
##                  95% CI : (0.683, 0.9877)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 0.0009274      
##                                          
##                   Kappa : 0.798          
##  Mcnemar's Test P-Value : 1.0000000      
##                                          
##             Sensitivity : 0.9091         
##             Specificity : 0.8889         
##          Pos Pred Value : 0.9091         
##          Neg Pred Value : 0.8889         
##              Prevalence : 0.5500         
##          Detection Rate : 0.5000         
##    Detection Prevalence : 0.5500         
##       Balanced Accuracy : 0.8990         
##                                          
##        'Positive' Class : -1             
## 

With the value of cost=0.1, 18 of the test observations are correctly classified. What if we had instead used cost=0.01?

svmfit<-svm(y~., data=dat, kernel="linear", cost=0.01, scale=FALSE)
ypred<-predict(svmfit, testdat)
confusionMatrix(ypred, testdat$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction -1  1
##         -1 10  4
##         1   1  5
##                                          
##                Accuracy : 0.75           
##                  95% CI : (0.509, 0.9134)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 0.05533        
##                                          
##                   Kappa : 0.4792         
##  Mcnemar's Test P-Value : 0.37109        
##                                          
##             Sensitivity : 0.9091         
##             Specificity : 0.5556         
##          Pos Pred Value : 0.7143         
##          Neg Pred Value : 0.8333         
##              Prevalence : 0.5500         
##          Detection Rate : 0.5000         
##    Detection Prevalence : 0.7000         
##       Balanced Accuracy : 0.7323         
##                                          
##        'Positive' Class : -1             
## 

In this case three additional observation is misclassified.

Now we consider a situation in which two classes are linearly separable. We can find a separating hyperplane using the svm() function. We fit the support vector classifier and plot the resulting hyperplane, using a very large value of cost so that no observations are misclassified (cost=1e5)

x[y==1,]=x[y==1,]+0.5
plot(x, col=(y+5)/2, pch=19)

dat<-data.frame(x=x, y=as.factor(y))
svmfit<-svm(y~., data=dat, kernel="linear", cost=1e5)
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 1e+05)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1e+05 
##       gamma:  0.5 
## 
## Number of Support Vectors:  3
## 
##  ( 1 2 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
plot(svmfit, dat)

No training errors were made and only three support vectors were used. However, we can see from the figure that the margin is very narrow (because the observations that are not support vectors, indicated as circles, are very close to the decision boundary). It seems likely that this model will perform poorly on test data. We now try a smaller value of cost(cost=1).

svmfit<-svm(y~., data=dat, kernel="linear", cost=1)
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
plot(svmfit, dat)

Using cost=1, we misclassify a training observation, but we also obtain a much wider margin and make use of seven support vectors. It seems likely that this model will perform better on test data than the model with cost=1e5.


2. Support Vector Machine

In order to fit an SVM using a nonlinear kernel, we use a different value of the parameter kernel. To fit an SVM with a polynomial kernel we use kernel=“polynomial” with degree argument. To fit an SVM with a radial kernel we use kernel=“radial” with gamma argument.

set.seed(1)
x<-matrix(rnorm(200*2), ncol=2)
x[1:100,]=x[1:100,]+2
x[101:150,]=x[101:150,]-2
y<-c(rep(1,150), rep(2,50))
dat<-data.frame(x=x, y=as.factor(y))
plot(x,col=(y)) # The class boundary is non-linear

train<-sample(200,100)
svmfit<-svm(y~., data=dat[train,], kernel="radial", gamma=1, cost=1)
plot(svmfit, dat[train,])

summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat[train, ], kernel = "radial", 
##     gamma = 1, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  37
## 
##  ( 17 20 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

The plot shows that the resulting SVM has a decidely non-linear boundary.

We can see from the figure that there are a fair number of training errors in this SVM fit. If we increase the value of cost, we can reduce the number of training errors. However, this comes at the price of a more irregular decision boundary that seems to be at risk of overfitting the data.

svmfit<-svm(y~., data=dat[train,], kernel="radial", gamma=1, cost=1e5)
plot(svmfit, dat[train,])

summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat[train, ], kernel = "radial", 
##     gamma = 1, cost = 1e+05)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1e+05 
##       gamma:  1 
## 
## Number of Support Vectors:  26
## 
##  ( 12 14 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

We can perform cross-validation using tune() to select the best choice of gamma and cost for an SVM with a radial kernel.

set.seed(1)
tune.out<-tune(svm, y~., data=dat[train,], kernel="radial",
               ranges=list(
                   cost=c(0.1, 1, 10, 100, 1000),
                   gamma=c(0.5, 1,2,3,4)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1     2
## 
## - best performance: 0.12 
## 
## - Detailed performance results:
##     cost gamma error dispersion
## 1  1e-01   0.5  0.27 0.11595018
## 2  1e+00   0.5  0.13 0.08232726
## 3  1e+01   0.5  0.15 0.07071068
## 4  1e+02   0.5  0.17 0.08232726
## 5  1e+03   0.5  0.21 0.09944289
## 6  1e-01   1.0  0.25 0.13540064
## 7  1e+00   1.0  0.13 0.08232726
## 8  1e+01   1.0  0.16 0.06992059
## 9  1e+02   1.0  0.20 0.09428090
## 10 1e+03   1.0  0.20 0.08164966
## 11 1e-01   2.0  0.25 0.12692955
## 12 1e+00   2.0  0.12 0.09189366
## 13 1e+01   2.0  0.17 0.09486833
## 14 1e+02   2.0  0.19 0.09944289
## 15 1e+03   2.0  0.20 0.09428090
## 16 1e-01   3.0  0.27 0.11595018
## 17 1e+00   3.0  0.13 0.09486833
## 18 1e+01   3.0  0.18 0.10327956
## 19 1e+02   3.0  0.21 0.08755950
## 20 1e+03   3.0  0.22 0.10327956
## 21 1e-01   4.0  0.27 0.11595018
## 22 1e+00   4.0  0.15 0.10801234
## 23 1e+01   4.0  0.18 0.11352924
## 24 1e+02   4.0  0.21 0.08755950
## 25 1e+03   4.0  0.24 0.10749677
tune.out$best.parameters
##    cost gamma
## 12    1     2
svm.pred<-predict(tune.out$best.model, newx=dat[-train,])
table(dat[-train,"y"], svm.pred)
##    svm.pred
##      1  2
##   1 56 21
##   2 18  5
confusionMatrix(dat[-train,"y"], svm.pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 56 21
##          2 18  5
##                                          
##                Accuracy : 0.61           
##                  95% CI : (0.5073, 0.706)
##     No Information Rate : 0.74           
##     P-Value [Acc > NIR] : 0.9985         
##                                          
##                   Kappa : -0.0529        
##  Mcnemar's Test P-Value : 0.7488         
##                                          
##             Sensitivity : 0.7568         
##             Specificity : 0.1923         
##          Pos Pred Value : 0.7273         
##          Neg Pred Value : 0.2174         
##              Prevalence : 0.7400         
##          Detection Rate : 0.5600         
##    Detection Prevalence : 0.7700         
##       Balanced Accuracy : 0.4745         
##                                          
##        'Positive' Class : 1              
## 

The best choice of parameters involves cost=1 and gamma=2. By this SVM 39% of test observations are misclassified.


3. ROC Curves

rocplot<-function(pred, truth, ...) {
    predob<-prediction(pred, truth)
    perf<-performance(predob, "tpr", "fpr")
    plot(perf, ...)}

SVMs and support vector classifiers output class labels for each observation. However, it is also possible to obtain fitted values for each observation, which are the numerical scores used to obtain the class labels. In essence, the sign of the fitted value determines on which side of the decision boundary the observation lies. Therefore, the relationship between the fitted value and the class prediction for a given observation is simple: if the fitted value exceeds zero then the observation is assigned to one class, and if it is less than zero then it is assigned to the other. In order to obtain the fitted values for a given SVM model fit, we use decision.values=TRUE when fitting svm(). Then the predict() function will output the fitted values.

svmfit.opt<-svm(y~., data=dat[train,], kernel="radial", gamma=2, cost=1, decision.values=TRUE) 
svm.pred1<-predict(svmfit.opt, dat[train,], decision.values=TRUE)
svm.pred2<-predict(svmfit.opt, dat[-train,], decision.values=TRUE)

fitted1<-attributes(svm.pred1)$decision.values
fitted2<-attributes(svm.pred2)$decision.values

par(mfrow=c(1,2))
rocplot(fitted1, dat[train,"y"], main="Training Data") # ROC plot
rocplot(fitted2, dat[train,"y"], main="Test Data", col="red") # ROC plot


4. SVM with multiple classes

If the response is a factor containing more than two levels, then the svm() function will perform multi-class classification using the one-versus-one approach.

set.seed(1)
x<-rbind(x, matrix(rnorm(50*2), ncol=2))
y<-c(y, rep(0,50))
x[y==0,2]=x[y==0,2]+2
dat<-data.frame(x=x, y=as.factor(y))
plot(x, col=y+1, pch=19)

svmfit<-svm(y~., data=dat, kernel="radial", cost=10, gamma=1)
plot(svmfit, dat)

We now examine the Khan data set, which consists of a number of tissue samples corresponding to four distinct types of small round blue cell tumors. The data set consists of training data, xtrain and ytrain and testing data, xtest and ytest.

names(Khan)
## [1] "xtrain" "xtest"  "ytrain" "ytest"
table(Khan$ytrain)
## 
##  1  2  3  4 
##  8 23 12 20
table(Khan$ytest)
## 
## 1 2 3 4 
## 3 6 6 5
dat<-data.frame(x=Khan$xtrain, y=as.factor(Khan$ytrain))
out<-svm(y~., data=dat, kernel="linear", cost=10)
summary(out)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.0004332756 
## 
## Number of Support Vectors:  58
## 
##  ( 20 20 11 7 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  1 2 3 4
table(out$fitted, dat$y)
##    
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20
confusionMatrix(out$fitted, dat$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1  8  0  0  0
##          2  0 23  0  0
##          3  0  0 12  0
##          4  0  0  0 20
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9431, 1)
##     No Information Rate : 0.3651     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity             1.000   1.0000   1.0000   1.0000
## Specificity             1.000   1.0000   1.0000   1.0000
## Pos Pred Value          1.000   1.0000   1.0000   1.0000
## Neg Pred Value          1.000   1.0000   1.0000   1.0000
## Prevalence              0.127   0.3651   0.1905   0.3175
## Detection Rate          0.127   0.3651   0.1905   0.3175
## Detection Prevalence    0.127   0.3651   0.1905   0.3175
## Balanced Accuracy       1.000   1.0000   1.0000   1.0000

We see that there are no training erros. This is not surprising, because the large number of variables relative to the number of observations implies that it is easy to find hyperplanes that fully separate the classes.

We are most interested not in the support vector classifier’s performance on the training observations, but rather its performance on the test observations.

dat<-data.frame(x=Khan$xtrain, y=as.factor(Khan$ytrain))
out<-svm(y~., data=dat, kernel="linear", cost=10)
dat.test<-data.frame(x=Khan$xtest, y=as.factor(Khan$ytest))

pred<-predict(out, newdata=dat.test)
table(pred, dat.test$y)
##     
## pred 1 2 3 4
##    1 3 0 0 0
##    2 0 6 2 0
##    3 0 0 4 0
##    4 0 0 0 5
confusionMatrix(out$fitted, dat$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1  8  0  0  0
##          2  0 23  0  0
##          3  0  0 12  0
##          4  0  0  0 20
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9431, 1)
##     No Information Rate : 0.3651     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity             1.000   1.0000   1.0000   1.0000
## Specificity             1.000   1.0000   1.0000   1.0000
## Pos Pred Value          1.000   1.0000   1.0000   1.0000
## Neg Pred Value          1.000   1.0000   1.0000   1.0000
## Prevalence              0.127   0.3651   0.1905   0.3175
## Detection Rate          0.127   0.3651   0.1905   0.3175
## Detection Prevalence    0.127   0.3651   0.1905   0.3175
## Balanced Accuracy       1.000   1.0000   1.0000   1.0000

We see that using cost=10 yields two test set errors on this data.